- 
                Notifications
    You must be signed in to change notification settings 
- Fork 15k
[Clang] Do not warn on UTF-16 -> UTF-32 conversions. (#163927) #164654
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| @llvm/pr-subscribers-clang Author: Corentin Jabot (cor3ntin) ChangesUTF-16 to UTF-16 conversions seems widespread, Lets not warn on this case to make the warning easier to adopt. This follows SG-16 guideline https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2025/p3695r2.html#changes-since-r1 Fixes #163719 Full diff: https://github.com/llvm/llvm-project/pull/164654.diff 2 Files Affected: 
 diff --git a/clang/lib/Sema/SemaChecking.cpp b/clang/lib/Sema/SemaChecking.cpp
index dd5b710d7e1d4..41bcf8fd493fc 100644
--- a/clang/lib/Sema/SemaChecking.cpp
+++ b/clang/lib/Sema/SemaChecking.cpp
@@ -12014,13 +12014,20 @@ static void DiagnoseMixedUnicodeImplicitConversion(Sema &S, const Type *Source,
                                                    SourceLocation CC) {
   assert(Source->isUnicodeCharacterType() && Target->isUnicodeCharacterType() &&
          Source != Target);
+
+  // Lone surrogates have a distinct representation in UTF-32.
+  // Converting between UTF-16 and UTF-32 codepoints seems very widespread,
+  // so don't warn on such conversion.
+  if (Source->isChar16Type() && Target->isChar32Type())
+    return;
+
   Expr::EvalResult Result;
   if (E->EvaluateAsInt(Result, S.getASTContext(), Expr::SE_AllowSideEffects,
                        S.isConstantEvaluatedContext())) {
     llvm::APSInt Value(32);
     Value = Result.Val.getInt();
     bool IsASCII = Value <= 0x7F;
-    bool IsBMP = Value <= 0xD7FF || (Value >= 0xE000 && Value <= 0xFFFF);
+    bool IsBMP = Value <= 0xDFFF || (Value >= 0xE000 && Value <= 0xFFFF);
     bool ConversionPreservesSemantics =
         IsASCII || (!Source->isChar8Type() && !Target->isChar8Type() && IsBMP);
 
diff --git a/clang/test/SemaCXX/warn-implicit-unicode-conversions.cpp b/clang/test/SemaCXX/warn-implicit-unicode-conversions.cpp
index fcff006d0e028..f17f20ca25295 100644
--- a/clang/test/SemaCXX/warn-implicit-unicode-conversions.cpp
+++ b/clang/test/SemaCXX/warn-implicit-unicode-conversions.cpp
@@ -14,7 +14,7 @@ void test(char8_t u8, char16_t u16, char32_t u32) {
     c16(u32); // expected-warning {{implicit conversion from 'char32_t' to 'char16_t' may lose precision and change the meaning of the represented code unit}}
 
     c32(u8);  // expected-warning {{implicit conversion from 'char8_t' to 'char32_t' may change the meaning of the represented code unit}}
-    c32(u16); // expected-warning {{implicit conversion from 'char16_t' to 'char32_t' may change the meaning of the represented code unit}}
+    c32(u16);
     c32(u32);
 
 
@@ -30,7 +30,7 @@ void test(char8_t u8, char16_t u16, char32_t u32) {
     c16(char32_t(0x7f));
     c16(char32_t(0x80));
     c16(char32_t(0xD7FF));
-    c16(char32_t(0xD800)); // expected-warning {{implicit conversion from 'char32_t' to 'char16_t' changes the meaning of the code unit '<0xD800>'}}
+    c16(char32_t(0xD800));
     c16(char32_t(0xE000));
     c16(char32_t(U'🐉')); // expected-warning {{implicit conversion from 'char32_t' to 'char16_t' changes the meaning of the code point '🐉'}}
 
@@ -44,8 +44,8 @@ void test(char8_t u8, char16_t u16, char32_t u32) {
     c32(char16_t(0x80));
 
     c32(char16_t(0xD7FF));
-    c32(char16_t(0xD800)); // expected-warning {{implicit conversion from 'char16_t' to 'char32_t' changes the meaning of the code unit '<0xD800>'}}
-    c32(char16_t(0xDFFF)); // expected-warning {{implicit conversion from 'char16_t' to 'char32_t' changes the meaning of the code unit '<0xDFFF>'}}
+    c32(char16_t(0xD800));
+    c32(char16_t(0xDFFF));
     c32(char16_t(0xE000));
     c32(char16_t(u'☕'));
 
 | 
| 
 UTF-16 to UTF-32, right? | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM aside from the PR summary, should have a release note (perhaps on the release branch).
| 
 sidenote: updated release notes on the maintenance branches almost never get published (though they should). 
 It would help immensely if the release notes were built and published in an automated fashion per branch (some prior thoughts on this), rather than being dependent on the availability and goodwill of the release managers. Though I realize this would be a big initial lift. | 
| 
 I'm new to release maintenance so this is new to me, thanks for mentioning. I'll raise it with the other release maintainers. | 
UTF-16 to UTF-16 conversions seems widespread, and lone surrogate have a distinct representation in UTF-32. Lets not warn on this case to make the warning easier to adopt. This follows SG-16 guideline https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2025/p3695r2.html#changes-since-r1 Fixes llvm#163719
9b3789a    to
    5c802f9      
    Compare
  
    | @cor3ntin (or anyone else). If you would like to add a note about this fix in the release notes (completely optional). Please reply to this comment with a one or two sentence description of the fix. When you are done, please add the release:note label to this PR. | 
UTF-16 to UTF-32 conversions seems widespread,
and lone surrogate have a distinct representation in UTF-32.
Lets not warn on this case to make the warning easier to adopt. This follows SG-16 guideline
https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2025/p3695r2.html#changes-since-r1
Fixes #163719